Dynamic multilayer growth: Parallel vs. sequential approaches

The decision of when to add a new hidden unit or layer is a fundamental challenge for constructive algorithms. It becomes even more complex in the context of multiple hidden layers. Growing both network width and depth offers a robust framework for leveraging the ability to capture more information from the data and model more complex representations. In the context of multiple hidden layers, should growing units occur sequentially with hidden units only being grown in one layer at a time or in parallel with hidden units growing across multiple layers simultaneously? The effects of growing sequentially or in parallel are investigated using a population dynamics-inspired growing algorithm in a multilayer context. A modified version of the constructive growing algorithm capable of growing in parallel is presented. Sequential and parallel growth methodologies are compared in a three-hidden layer multilayer perceptron on several benchmark classification tasks. Several variants of these approaches are developed for a more in-depth comparison based on the type of hidden layer initialization and the weight update methods employed. Comparisons are then made to another sequential growing approach, Dynamic Node Creation. Growing hidden layers in parallel resulted in comparable or higher performances than sequential approaches. Growing hidden layers in parallel promotes growing narrower deep architectures tailored to the task. Dynamic growth inspired by population dynamics offers the potential to grow the width and depth of deeper neural networks in either a sequential or parallel fashion.


Introduction
When deciding the architecture of an artificial neural network, a fundamental question arises; is it better to use a shallow and wide network or deep and narrow?This problem is commonly known as the trade-off between width and depth.Where width refers to the number of units in a hidden layer and depth refers to the number of hidden and output layers [1].
Previously, it was shown that multilayer feedforward networks with a single hidden layer and a large enough width could approximate any continuous function, making these shallow networks universal approximators [2,3].Despite this, the ability of deeper architectures to learn more complex, distributed, and sparse representations make them more powerful than their shallow counterparts [1,[4][5][6].With the increase in network depth, learning more abstract and complex representations can occur, allowing the network to discriminate inputs better [5,7].This brings us back to the trade-off, is depth more valuable than width?Bianchini and Scarselli [4] used Betti numbers, a topological measure, to compare shallow and deep feedforward networks.They showed that a deep neural network with the same number of hidden units as its shallow counterpart could realize more complex functions.Eldan and Shamir [6] demonstrated that for a fully connected feedforward network with a linear-output unit, "depth-even if increased by 1-can be exponentially more valuable than width for standard feedforward networks."With larger data sets and more powerful graphics processing units, the increased representational power and faster training of deep neural networks have made them valuable tools.Deep neural networks have been applied to many different problems, including computer vision, object detection, and electroencephalogram classification, to name a few (for reviews, see [5,[8][9][10]).
A network's depth and width affect the neural network's ability to generalize.If the network architecture is too large, the model may overfit the data and cause poor generalization.Conversely, if the network architecture is too small, the model may underfit the data and cause over-generalization [11][12][13][14][15].This reinforces the need to find optimal architectures that match a given task complexity.Currently, the best practice is to use fixed neural architectures found using a trial-and-error approach.This is due to its simplicity of implementation.However, this process can be temporally cumbersome and there is no guarantee that an optimal or near-optimal topology will be found [11,14,16,17].
An alternative to using fixed architectures found by a trial-and-error approach is to use adaptive neural architectures.With adaptive neural architectures, the idea is to have a dynamic architecture that is updated during training [18].Several strategies have been proposed and can be broadly classified into three categories: constructive algorithms, pruning algorithms, and hybrid methods.
Constructive algorithms involve starting with a small network architecture (typically one hidden unit) and gradually adding connections, units, or layers during training to match task complexity [18][19][20].Conversely, pruning algorithms involve starting with a larger network where connections and units are pruned (removed) during the learning process to match task complexity (for surveys, see [21][22][23][24]).Hybrid methods combine constructive and pruning algorithms.These methods typically involve using a constructive algorithm to grow the neural architecture first, then prune the subsequent architecture, or simultaneously grow and prune during the learning process [25].While using a hybrid approach is appealing and has had great success [16,[26][27][28][29][30][31]; the focus of the present article is on constructive algorithms.
Constructive algorithms offer the possibility of compact architectures as an alternative to the trial-and-error approach when designing architectures.The focus of constructive algorithms has primarily been on growing the width of a single hidden layer.Examples of applications with this focus include regression problems [32,33], classification [12,13,20,[34][35][36][37][38][39][40][41][42][43][44][45][46], and image segmentation [47] to name a few (for a list of more applications, see [19]).Conversely, approaches based on cascade correlation have offered a constructive approach that focused on growing network depth instead of width.These approaches use a cascaded architecture where the network grows many hidden layers the width of a single unit [31,[48][49][50][51][52].
Growing both network width and depth offers a robust framework for leveraging the ability to capture more information from the data and model more complex representations.There have been numerous approaches for determining when new hidden units and layers should be added [14,[53][54][55][56][57].For instance, a measure of network significance to quantify generalization power has been proposed for growing network width in combination with drift detection for growing depth [58,59].Others have examined whether the error or loss has stopped changing by a predetermined threshold value.Baluja and Fahlman [60] proposed a sibling/descendant cascade correlation, where once the error stops changing, candidate units can be added to the same layer (siblings) or a new layer (descendant).The candidate pool is made of both types of units, and whichever unit reduces the residual network error the most is added to the network.Zemouri et al., [16] introduced the growing pruning deep learning neural network (GP-DLNN) that uses a penalty term to and compared to a threshold value.If the penalty is smaller than the threshold, a new unit is added, and a new layer is added if larger.Similarly, the cascade-correlation growing deep learning neural network CCG-DLNN, a constructive approach grounded in cascade correlation, also used the same approach to decide if candidate units should be added to the more recent layer or a new layer [29].Another approach is to set the maximum number of hidden units and layers in advance, as with the evolutionary algorithm for building deep neural networks [17,61].This approach calculates and compares the error at each step to a threshold value.Growth is complete if the error converges or the maximum allowed structure is reached.Only if the error convergence is not reached and the maximum number of hidden units for that layer is reached will a new layer be added [17,61].
When employing constructive algorithms, serval challenges have been identified and should be considered [16,25]: • What are the criteria for adding a new unit or layer to the network?
• How to connect a newly added hidden unit?
• How should the weights be initialized?
• What training scheme should be used (re-train the whole network vs. freezing)?
• When to stop adding hidden units (what are the convergence criteria)?
Here, we focus on the first challenge: the criteria for adding a new unit or layer to the network.Specifically, we are interested in when a new hidden unit or layer should be added.In the context of multiple hidden layers, should growing units occur sequentially or in parallel across hidden layers?With a sequential growth methodology, we refer to only being able to grow hidden units in one hidden layer at a time.The examples of constructive algorithms that grow both width and depth mentioned previously fall under this methodology.Contrarily, with a parallel growth methodology, we refer to being able to grow hidden units across multiple layers simultaneously.This methodology is scarcely found in the literature, with a modular network approach being the most evident.Guan and Li [62] used such an approach.They broke benchmark classification problems into simpler subproblems using task decomposition and output parallelism.Modules were then grown and trained in parallel to solve each subproblem.The resulting modules that could solve the subproblems were merged into a modular network.
Previously, we introduced a constructive growing algorithm that provides a more self-governed alternative for deciding when a new unit should be added to a single hidden layer multilayer perceptron [63].This approach was inspired by population dynamics [64] and considered the hidden units as a population and the hidden layer as the environment they exist in.This endowed the hidden layer with a carrying capacity, the maximum population the hidden layer environment can sustain (an upper bound on the layer width).This allowed the application of population dynamics to provide a hidden unit population growth rate for the hidden layer.Combining the carrying capacity and direct performance feedback from the network created a built-in dynamic, a more self-governed method for growing the hidden layer, simultaneously preventing growing outbound.A natural extension of this work would be to grow both width and depth.Since each hidden layer is treated as having its hidden unit population, should the growth rate of each population be considered individually in a sequential fashion or simultaneously in a parallel fashion?
We propose investigating the effects of growing sequentially or in parallel using our population dynamics inspired growing algorithm in a multilayer context.To achieve this, we create a modified version of our constructive growing algorithm capable of growing in parallel.We then test the methodologies of sequential and parallel growth in a three-hidden layer multilayer perceptron on several benchmark classification tasks.We test several variants of these approaches for a more intricate comparison based on the hidden layer initialization and the training scheme.Comparisons are also made to another sequential growing approach, Dynamic Node Creation (DNC; [35]), that employs the common methodology of iteratively adding units based on the error curve flattening within a specific time frame.
The remainder of the paper is divided as follows: Section II introduces the growing approaches employed (sequential growth, parallel growth, DNC growth), the network's architecture and variants, and the learning procedure.In Section III, we describe the results of several benchmark classification tasks.Finally, in Sections IV and V, we discuss and conclude the study's findings.

Growing
Sequential growth.The dynamics of the sequential growing algorithm are characterized by having only one hidden layer able to grow at a time.In this context, the network is not able to grow the subsequent layer (ℓ+1) until the current layer (ℓ) has reached the maximum number of hidden units that it can sustain as dictated by the carrying capacity (C).A visualization of the sequential growing method's effect on growth rate and changes in hidden layer sizes for a three-hidden layer MLP is depicted in Fig 1 .Sequential growth of the hidden layers is dictated by Eq (1).This algorithm was inspired by the Single-species model with the Allee effect [64].It is used to calculate the change in the size of the hidden unit population (i.e., growth rate; dh s dt ) for each hidden layer (ℓ).Incorporated into the algorithm is the global network error (E) as calculated by the loss function.This allows the network's performance to modulate the growth rate of the hidden unit population during learning.The result is that regardless of what the carrying capacity (C) is set to, the hidden unit population will grow based on the needs of the network.As such the carrying capacity is defined as C2ℕ 1 |C<1.The addition of a new unit to the hidden layer only occurs when the growth rate reaches an integer value, as adding a proportion of a single unit is not plausible.The extinction constant (α) acts as a lower bound on the size of the hidden layer population.The population will go locally extinct and prune units if the population falls below this threshold value.As this study aims to examine growth, α is set just below zero to remove the possibility of extinction.
where C is the carrying capacity of the hidden layer environment (upper bound), ℓ is the hidden layer number, h s is the growth of the hidden unit population, E is the current global error of the network, and α is the extinction constant (lower bound) set to -0.1.Parallel growth.The dynamics of parallel growing algorithm are characterized by having all hidden layers in the network growing simultaneously.In this context, the network can grow the current layer (ℓ) and all subsequent layers (ℓ⋯m) regardless if the carrying capacity (C) has been reached by any hidden layer.The growth rate of the first hidden layer ðh 0 s Þ is calculated according to Eq (1), while the growth rate of every subsequent layer (h ' s ) is calculated according to Eq (2).The growth rate of the first layer is independent of the size of any other layer, in contract subsequent hidden layers are impacted by the size of the hidden layer that proceeds it (h 'À 1 s ).The effect of previous hidden layer sizes being considered allows for a more staggered growth rate for deeper layers.As the previous hidden layer increases in size and begins to approach the carrying capacity, where once it reached it will no longer be able to add any new hidden units, the current layers growth rate is dynamically increased to compensate for the continued need to add more hidden units to match the tasks complexity (see Fig 2).A visualization of the parallel growing method's effect on growth rate and changes in hidden layer sizes for a three-hidden layer MLP is depicted in Fig 3 .dh ' where C is the carrying capacity of the hidden layer environment (upper bound), h ' s is the growth of the hidden unit population of the current layer, h 'À 1 s is the growth of the hidden unit population of the previous layer, E is the current global error of the network, and α is the extinction constant (lower bound) set to -0.1.
DNC growth.Iteratively adding new units with DNC occurs when the error curve flattens within a specific time frame (i.e., window width; [35]).This requires an a priori finely-tuned trigger slope (Δ T ), which measures the error curve's flatness, whereby a new unit is added if the error falls below this value.Additionally, the user must finely tune the window width (z), the number of epochs over which the flatness of the error curve is compared to the trigger slope.If the error curve flattens out below the trigger slope (Δ T ), and if enough epochs have passed since the last addition, a new unit is added.This is determined by Eqs (3) and ( 4).The window width (z) was set to 1000 epochs and the trigger slope (Δ T ) was set to 0.01 to promote the construction of smaller architectures.To allow a comparison in the context of multiple hidden layers, here, DNC follows a sequential growth methodology.The network is not able to grow the subsequent layer (ℓ+1) until the current layer (ℓ) has reached the maximum number of hidden units that it can sustain as dictated by the carrying capacity (C).
where ρ [t] is the average error at time t across all output units, z is the window width in epochs over which the slope is determined, Δ T is the trigger slope.
where t 0 is the time in epochs when the last unit was added to the hidden layer.

Architecture: A formal description
A multilayer perceptron (MLP) with m hidden layers is described by the number of units and the number of weight connections (see Fig 4).Let the size of layer ℓ at training epoch t be denoted by S 0 ½t� , with the size of the all the layers in the network at training epoch t be defined by vector ψ [t] .The architecture of the network is thus represented as c ½t� ¼ ðS 0 ½t� ; S where w ' ij is the weight connection between the i th unit of the layer ℓ and the j th unit of layer ℓ−1.
The size of the input layer (d) and output layer (k) are fixed and dependent on the given task.In contrast, the width and depth of the hidden layers (ψ [t] ) are permitted to grow.For all variants the maximum number of hidden layers is defined by Max ℓ and the maximum number of hidden units per hidden layer is dictated by the carrying capacity (C).To simplify comparisons, Max ℓ was fixed to three hidden layers, and C was fixed at 100 hidden units.The initialization condition gives rise to different network variants.
• In the first condition, the network is only initialized with one hidden layer with a single hidden unit (First-Init variant), c ½0� ¼ ðS 0 ½0� ; S  • Alternatively, the second condition initializes all three hidden layers with a single hidden unit (All-Init variant), When adding a new unit with any of the growing conditions (sequential, parallel, or DNC), the new unit is added to the end of the hidden layer with all associated incoming and outgoing weight connections in a fully connected fashion.These weight connections are randomly initialized.During learning, the weights are updated via batch stochastic gradient descent with momentum by two possible conditions: • All weights in the network are updated at each training epoch t (UAW variant), for a visualization see Fig 6A.
• Only the incoming weights to the layer that is actively growing are updated and all other weights in the network are frozen (F variant), for a visualization see Fig 6B.

Learning
The following section outlines the learning process and the algorithms for each of the growing approaches.To measure network performance, the categorical cross entropy loss function ( 5) is calculated at the end of each epoch as a measure of error with the addition of a weight penalty from L 2 regularization (ridge regression).
where ŷq k is the k th scalar value output from the network for example q, y q k is the corresponding target value, and N is the total number of classes, and M is the total number of instances in the data, d is the number of input features, and λ the regularization parameter tuned to 0.001.At each training epoch t,R batch iterations of batch stochastic gradient descent with momentum are computed to update the weights W ℓ .At each batch iteration iter(iter = 1 to R), momentum, the exponential moving average, is calculated according to: where m [t] is the velocity vector, β is the momentum term (friction coefficient) set at 0.9, and g [t] is the gradient at time t.The weights are then updated according to: where η is the learning rate set to 0.01.The sigmoid activation function ( 8) is used, with the softmax function ( 9) being used to scale the final output of the network and generate interrelated probabilities for multiclass classification between 0 and 1.
where h ' j is the hidden layer output from unit j for layer ℓ obtained in the usual way (for details, see [63]) and h 'in j is its corresponding activation.
ŷ ¼ e y in u where ŷ is the categorical probabilistic output vector that sums to 1 and y in u is the input vector.The denominator is a normalization term across all classes (N).
With all approaches, training continues until one of the following conditions are met: 1.The maximum number of training epochs is reached, Max t = 100,000.
2. The training categorical cross entropy falls below the convergence threshold, Min E = 0.01.
3. The difference in testing accuracy over a 1000 epoch window is less than the testing accuracy threshold, θ = 0.01, and the testing accuracy from the last epoch is >0.8 (80%).

Growing algorithms
As previously explained in the architecture section, four potential network variants for each growing approach are explored.Consequently, each algorithm is composed of two essential components: its initialization condition (referred to as //Initialization of the network topology: yÞ //Random initialization of the weights from a uniform distribution ranging from -0.1 to 0.1: // Initialization of the growth rate for the hidden layers: if F-Variant//Only the incoming weights to active layer are updated and others frozen: //Initialization of the network topology: //Random initialization of the weights from a uniform distribution ranging from -0.1 to 0.1: // Initialization of the growth rate for the hidden layers: calculate loss and calculate growth rate //Only the incoming weights to the active layer are updated.
calculate loss calculate growth rate  //Initialization of the network topology: yÞ //Random initialization of the weights from a uniform distribution ranging from -0.1 to 0.1: // Initialization of the growth rate for the hidden layers: //Initialization of the network topology: yÞ //Random initialization of the weights from a uniform distribution ranging from -0.1 to 0.1: if Sequential growth: // Initialization of the growth rate for the hidden layers: calculate loss calculate growth rate calculate accuracy TrainAcc [t] and TestAcc [t] calculate growth rate calculate loss

Experiments and results
The proposed parallel growing algorithm was compared to the sequential version of our growing algorithm as well as DNC in the MLP on three benchmark classification data sets: the Breast Cancer Wisconsin (Diagnostic; BCW) data set [65], the Wine data set [66], and the Fashion MNIST data set [67].These data sets vary in the number of features, classes, and instances (see Table 1).For a more detailed comparison, four variants for each approach (Parallel, Sequential, and DNC) were considered.These variants were based on the number of layers pre-initialized; one (First Init) or all (All Init), and the type of weight update; updating all weights (UAW) or just the incoming connections to the active layer and freezing the others (F).
The results of the binary classification of the BCW data set are shown in Table 2.With the All Init-UAW variant, all approaches had similar training and testing performances.Slight variations in network size were observed, with the DNC approach growing fewer units than the parallel.With the First Init-UAW variant, all approaches were consistent in terms of size, performance, and training epochs.Given the initialization of only one hidden layer and the task was easier for the network, fewer hidden units were required.The parallel approach behaved similarly to the sequential approach with no additional hidden layers grown.Under the First Init-F variant, similar training and testing performances were observed across all approaches, with DNC having smaller variations.Interestingly, the parallel approach grew three layers and approximately 21 hidden units.Conversely, the sequential and DNC approaches only required one hidden layer and approximately 5 units.Notably, the sequential approach also took considerably fewer training epochs with a smaller standard deviation.Finally, for the All Init-F variant, the parallel approach achieved near-perfect training and testing performances-outperforming both the sequential and DNC approaches, which reached about 60% and 50% respectively.Additionally, both the sequential and DNC approaches reached the maximum number of training epochs, whereas the parallel approach required considerably fewer epochs.On average, the parallel approach, grew smaller (three layers and about 140.6 units) than the sequential (three layers and about 292.9 units) but grew larger than DNC (three layers and about 102.0 units).Notably, due to the window width limitation for DNC (set at 1000 epochs) and the maximum number of hidden units is set at 100 per layer; DNC could not grow units past the first layer.
The three-class Wine data set showed similar results to the BCW (see Table 3).For the All Init-UAW variant, all approaches had similar average accuracies.However, the sequential approach had a large standard deviation of about 6%.In terms of the average size, DNC was the smallest (24 units), followed by parallel approach (74.9 units), and the sequential approach (101.7 units).Despite not being the smallest, the parallel approach took noticeably fewer training epochs and had less variation.For the First Init-UAW variant, all approaches were almost identical in size, each having approximately one hidden unit.Similar to the BCW dataset, as only one hidden layer was initialized and the task was less complex, fewer hidden units were required.The parallel approach again behaved similar to the sequential approach with no additional hidden layers grown.Performance-wise, all approaches were comparable, with DNC having a larger standard deviation of about 4%.A noteworthy observation is the difference in training epochs.The parallel approach, on average, needed 110 training epochs, while the other approaches needed over 2,000.For the First Init-F variant, the parallel approach grew larger (three hidden layers with an average of 166.8 hidden units) compared to the other approaches that only used one hidden layer (sequential: 23.2 hidden units, and DNC: 5.8 hidden units).Despite the parallel approach having, on average, 2% higher accuracies, it took more than double the training epochs.For the final variant, the All Init-F variant, the parallel and sequential approaches had similar performances.Nevertheless, the sequential method grew to the maximum of 100 units per layer (300 units total) and required an average of over 90,000 training epochs.Comparatively, the parallel approach utilized only 248.3 units with an  4).For the All Init-UAW variant, the parallel approach had approximately 40% better average training and testing accuracies compared to both the sequential and DNC.The parallel approach also grew on average substantially smaller (22.6 hidden units) compared to the other approaches (sequential: 259.2 hidden units; DNC 83.0 hidden units) and took only 5,893 epochs to train (about 61,000-94,000 epochs less).For the First Init-UAW variant, both the parallel and sequential approaches had similar sizes, performances, and training epochs on average.Conversely, DNC grew smaller (about three fewer hidden units), had a lower training accuracy (about 5% less), but took significantly more epochs.For the First Init-F variant, the parallel approach had the highest performance.However, on average, the parallel approach also grew significantly larger (three hidden layers with 215 hidden units) compared to the other two approaches (sequential: 51.6 hidden units; DNC 24.8 hidden units).For the final variant, the All Init-F, the parallel approach vastly outperformed the other approaches.The parallel approach achieved approximately 40% higher training and testing accuracies than the sequential approach.It also took fewer training epochs and grew substantially smaller (170 vs. 300 hidden units).The DNC approach encountered the same limitation and could not grow past the first layer.This resulted in poor performances and reaching the maximum number of training epochs.Regarding the variants, initializing only one hidden layer (First Init) yielded smaller topologies, faster training times, and similar performances across all tasks compared to initializing all the hidden layers (All Init).Similarly, updating all weights (UAW) yielded smaller topologies, faster training times, and similar performances across all tasks compared to using a freezing approach where only the active layer's weights were updated (F).Overall, the parallel approach rivals its sequential counterpart.With the two and three-class problems, the parallel approach demonstrated comparable performances, often accompanied by the potential for fewer training epochs.However, the improved training efficiency tended to coincide with larger architectures.With the more challenging 10-class problem, the parallel rivaled or outperformed the sequential approaches.The only exception was with the First Init-F variant, where the parallel approach resulted in larger structures and longer training times.Specifically, for both the All Init variants, the parallel approach grew smaller topologies, achieved higher accuracies, and took fewer training epochs.For the First Init-UAW, the parallel approach performed the equivalently to the sequential approach, as no additional hidden layers were needed to perform the task.

Discussion
This study focused on the challenge of when a new hidden unit or layer should be added when using a constructive algorithm.We proposed investigating the effects of growing sequentially or in parallel in a multilayer context.Sequential growth was characterized by only growing one hidden layer at a time.In contrast, parallel growth was characterized by growing all hidden layers simultaneously.To achieve this, we created a modified version of our population dynamics inspired growing algorithm capable of growing in parallel.Several variants of these approaches based on the hidden layer initialization and the training scheme employed were used to ensure a more comprehensive comparison.The sequential and parallel approaches were tested on several benchmark classification tasks.Another sequential growing approach, DNC, was also compared as it used the standard methodology of iteratively adding units based on the error curve flattening within a specific time frame.
The results highlight two scenarios where parallel growing is more beneficial than sequential.The first is realized by comparing to the DNC approach, which iteratively adds units based on the error curve flattening within a specific time frame.The All Init-F variant highlighted the limitation of a constructive approach heavily reliant on fine-tuning hyperparameters.While a smaller trigger slope and a larger window width can lead to smaller topologies, it can prevent the network from sequentially growing multiple layers in a reasonable time frame.In this case, with a maximum of 100,000 training epochs and a higher maximum hidden units per layer, it is impossible to grow depth with DNC if the task requires it.If the hyperparameters are not finely-tuned to the specific task, the resulting architecture can grow larger than the minimal number of units needed or prevent the network from growing beyond the minimal amount [35].Secondly, for the more challenging 10-class problem, when multiple hidden layers were initialized, the parallel growing approach surpassed the sequential approaches.It offered smaller layer widths, faster training times, and higher performances.
The possibility of the resulting architecture being larger than required for a given task can be considered a limitation of growing in parallel.Increasing network complexity beyond what is required can lead to overfitting [11].A possible solution to this problem is to create a hybrid growing-pruning approach by further developing the population dynamics inspired approach to include pruning.Including network pruning offers a way to reduce the structural complexity of the network systematically.One way to achieve this would be to have a negative growth rate to decay the neuronal population and obtain an optimal minimal topology.Including a way to prune the network would allow for comparisons to more state-of-the-art approaches, as hybrid methods appear to be more suitable in searching for optimal architectures [16].
In our population dynamics inspired growing algorithm, the global network error is used as a measure of performance feedback that modulates the growth rate of the hidden population.As the error converges toward zero, the growth rate likewise converges toward zero.This property gives rise to the algorithms' ability to grow near-optimal architectures based on task complexity.One potential issue is the notion of the network error not converging to a minimal value due to noise in the data, resulting in a continuously growing architecture.An alternative might be to use the network bias and variance as a built-in metric for scaling, growing and pruning, such as by calculating network significance [58,59].Essentially, this could act as a form of early-stopping for the growing process.
The focus of this study was not to obtain the most optimal performance but to investigate the comparison between sequential and parallel growing methods.As such, no preprocessing methods or modifications to the data set were applied.Performing preprocessing techniques or processes such as data augmentation can introduce confounding variables, making meaningful comparisons between models challenging [22].However, preprocessing allows using more real-world datasets and can lead to faster learning and higher classification accuracies [68].An interesting extension for comparing sequential and parallel growing would be to implement the constructive algorithm in a CNN to learn larger coloured images, such as the CIFAR-10.CNNs are comprised of convolutional and pooling layers that are responsible for feature extraction, followed by a fully-connected layer that is responsible for classifying the image.Replacing the pre-designed architecture of the fully-connected layer with a constructive algorithm gives rise to a more adaptive CNN.Mohamed et al. [69] replaced the fully-connected layer in a CNN with their cascade-correlation growing deep learning neural network algorithm (CCG-DLNN) to successfully classify lung cancer images.With the success of a sequential growing approach in a CNN, the question arises could a parallel growing approach be beneficial to reduce time spent growing?

Conclusion
The decision of when to add a new hidden unit or layer is a fundamental challenge for constructive algorithms.It becomes even more complex in the context of multiple hidden layers.The application and testing of parallel growing methods, in general, merit further investigation.

Fig 1 .
Fig 1.A sample of sequential growing dynamics for a three-hidden layer MLP.a) Sequential growth rate of three-hidden layers across training epochs.b) Changes in hidden layer size across three-hidden layers with a carrying capacity (C) of 100 during training.https://doi.org/10.1371/journal.pone.0301513.g001

Fig 2 .
Fig 2. The effect of hidden layer size on the growth rate of subsequent layers.Where C is the carrying capacity and α is the extinction constant.a) Initial staggered growth rate of hidden layers, with deeper layers growing at a slower rate.b) Increase in the second layers growth rate as a response to the first hidden layer almost reaching the carrying capacity.https://doi.org/10.1371/journal.pone.0301513.g002

Fig 3 .
Fig 3. A sample of parallel growing dynamics for a three-hidden layer MLP.a) Parallel growth rate of three-hidden layers across training epochs.b) Changes in hidden layer size across three-hidden layers with a carrying capacity (C) of 100 during training.https://doi.org/10.1371/journal.pone.0301513.g003

Fig 4 .
Fig 4. MLP architecture.In the above architecture x is vector of length d, where d, is the dimension the input vector, s is the current number of hidden units in that layer, b the biases, and the dashed lines and dashed circles are weight connections and hidden units that are incrementally added during growing.https://doi.org/10.1371/journal.pone.0301513.g004

Fig 5 .
Fig 5. Network initialization conditions.a) Only one hidden layer initialized with a single hidden unit.b) All three hidden layers initialized with a single hidden unit.https://doi.org/10.1371/journal.pone.0301513.g005

Fig 6 .
Fig 6.Weight update conditions.a) All network weights can be updated as the network grows (solid black lines).b) Only the incoming weights to the layer that is actively growing can be updated (solid black lines), all other weights in the network are frozen (light grey lines).In this example, initially the first hidden layer is active, then second hidden layer grows a new unit and becomes the active layer.https://doi.org/10.1371/journal.pone.0301513.g006 All Init-Variant or Part 1a, and First Init-Variant or Part 1b) and its weight update condition (referred to as UAW-Variant or Part 2a, and the F-Variant Part 2b).The specific details of the growing algorithms for both the parallel and sequential approaches are outlined below.Parallel growth Part 1a Initialization: All Init-Variant //Initialize all hidden layers.t = 0, Starting epoch.Max ℓ = 3, Maximum number of hidden layers.Init ℓ = 3, Initial number of hidden layers.C = 100, Maximum number of hidden units that a layer can sustain.if F-Variant//Only the incoming weights to active layer are updated and others frozen: Activ ℓ = 1, Initial active layer.
; 0; 0Þ Part 1b Initialization: First Init-Variant //Initialize only the first hidden layer.t = 0, Starting epoch.Max ℓ = 3, Maximum number of hidden layers.Init ℓ = 1, Initial number of hidden layers.C = 100, Maximum number of hidden units that a layer can sustain.
Sequential growth and DNC growth Part 1a Initialization: All Init-Variant //Initialize all hidden layers.t = 0, Starting epoch.Max ℓ = 3, Maximum number of hidden layers.Init ℓ = 3, Initial number of hidden layers.Current ℓ = 1, Starting layer for sequential growth.C = 100, Maximum number of hidden units that a layer can sustain.
the layer has reached the carrying capacity.Part 2b Learning and Sequential Growing: F-Variant//Only the incoming weights to active layer are updated and others frozen.

Table 2 . Average results of the parallel, sequential, and DNC variants-Breast Cancer Wisconsin.
Averages were calculated across 10 trials with the std.dev.reported.https://doi.org/10.1371/journal.pone.0301513.t002average of 18,7171 epochs.DNC faced the same limitation, unable to grow beyond the first layer, resulting in poor performances and reaching the maximum number of training epochs.The most demanding data set was the ten-class Fashion MNIST (see Table

Table 3 . Average results of the parallel, sequential, and DNC variants-Wine.
Averages were calculated across 10 trials with the std.dev.reported.

Table 4 . Average results of the parallel, sequential, and DNC variants-Fashion MNIST.
Averages were calculated across 10 trials with the std.dev.reported.